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Abstract 

Background: Accurate somatic mutation-calling is essential for insightful mutation analyses in cancer studies. 
Several mutation-callers are publicly available and more are likely to appear. Nonetheless, mutation-calling is still 
challenging and there is unlikely to be one established caller that systematically outperforms all others. Therefore, fully 
utilizing multiple callers can be a powerful way to construct a list of final calls for one's research. 

Results: Using a set of mutations from multiple callers that are impartially validated, we present a statistical approach 
for building a combined caller, which can be applied to combine calls in a wider dataset generated using a similar 
protocol. Using the mutation outputs and the validation data from The Cancer Genome Atlas endometrial study 
(6,746 sites), we demonstrate how to build a statistical model that predicts the probability of each call being a somatic 
mutation, based on the detection status of multiple callers and a few associated features. 

Conclusion: The approach allows us to build a combined caller across the full range of stringency levels, which 
outperforms all of the individual callers. 
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Background 

Somatic mutations are genetic changes that occur in 
somatic cells after conception. Cancer is driven by such 
somatic alterations, and thus cataloging somatic muta- 
tions is essential to understand the genetic bases of cancer 
development. With the burst of high-throughput sequenc- 
ing data generated in recent years, extensive efforts 
have been made towards accurate somatic mutation- 
calling. Many calling algorithms are now publicly avail- 
able, including Shimmer [1], MuTect [2], Strelka [3], 
MutationSeq [4], JointSNVMix [5], and SomaticSniper 
[6]. Additional in-house callers are likely to be under 
development for on-going studies. Nonetheless, many 
challenges remain to be addressed, for example, remov- 
ing artifactual variants from multiple sources, detecting 
rare variants in highly heterogeneous tumor samples, 
detecting variants at a shallower sequencing coverage. 
Every caller will tackle these issues, but different callers 
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are likely to be more successful on some of them and 
less so on others. As a consequence, finding the single 
best performing caller is not easy, and perhaps not even 
feasible. 

Having multiple callers on hand, anyone conducting a 
mutation analysis may want to apply all of the callers to 
his/her data with the aim of later constructing a list of final 
calls. In essence, combining calls from multiple callers 
amounts to developing a strategy to sort the calls to be 
included as final calls. This can be done effectively if one 
can systematically assign a confidence measure to be a 
somatic mutation across the full list. In general, pursuing 
this goal requires a validation dataset to some extent. For 
example, the paper by Lower et al. [7] presented a method 
to prioritize calls from three methods by assigning false 
discovery rate confidence values, but it requires the inde- 
pendent sequencing of at least one of the tumor or normal 
samples. 

In our work, we are considering a situation in which 
mutation-calling is done (by multiple callers) for many 
tumor-normal sequence pairs across a large genomic 
regions such as whole genome or exome, but only a lim- 
ited resource is available for validation. For example, in 
practice, often only a small fraction of detected mutations 
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can be validated or a small subset of regions in a selected 
list of samples are re-sequenced for evaluation purposes. 
We aim to build a combined caller, which is learned 
based on the relatively small validation dataset but can be 
applied to a wider dataset generated based on a similar 
protocol. 

A large corpus in the statistical literature is dedicated to 
combining individual learners, see e.g. Chapter 16 of [8], 
however most of them — e.g., boosting, bagging and 
random forests — are based on building individual learn- 
ers from descriptors rather than combining outputs of 
algorithms. Stacking [9] was introduced as a mean of com- 
bining such outputs. In this paper, we exploit this well 
established framework to merge the outputs of different 
callers. 

Specifically, we present a statistical approach for com- 
bining calls from multiple somatic mutation-callers, when 
validation is impartially done for all mutations detected 
by all callers in a selected set of regions or samples. 
For 194 tumor-normal exome-seq pairs from The Can- 
cer Genome Atlas (TCGA) endometrial study [10], single 
nucleotide variant (SNV) type mutations (i.e., point muta- 
tions) were detected by three somatic mutation-callers. 
Validation through an independent re-sequencing was 
impartially done for all the mutations detected from 20 
selected patients across the whole exome and for those 
mutations detected within 243 genes of interest across all 
194 patients. We used this data to show how our statisti- 
cal approach improves against individual callers and naive 
combination based on caller intersection. We also show 
that this improvement is maintained when the parame- 
ters of the model are estimated on a set of samples or 
regions different from the ones on which the performance 
is evaluated. 

Methods 

Our aim is to build a combined caller using the mutation 
outputs generated by multiple callers based on the same 
paired tumor-normal sequence data (BAMs; [11]), when 
the mutation calls are impartially validated. For illustra- 
tion purposes, we assume K = 3 callers (Caller A, B, and 
C) are used for mutation-calling. The most basic and key 
information available in each mutation output is the list of 
positions detected as point mutations. A mutation output 
may include additional features such as mutation substi- 
tution type, mutation quality score, and perhaps details 
of filters applied to remove artifactual or low-quality vari- 
ants. When the raw sequence data are available, genomic 
features can be computed for each mutation site such 
as sequencing depth and the variant allele fraction (the 
fraction of reads carrying the variant allele) for each 
tumor and normal sample. The more information that is 
available, the more powerful are the callers that can be 
constructed. 



Taking intersections or unions 

One natural and simple way to build a combined caller 
is to take intersections or unions of the calls from three 
callers as final calls. For example, one may take the muta- 
tions detected by all callers (ABC), or take intersections of 
mutations from two callers (AB, AC, or BC), or take calls 
detected by at least two callers C2orMore'), or even take 
calls detected by any caller (Union). This strategy is very 
intuitive and can be immediately used in practice once a 
Venn diagram is drawn from calls, as in Figure 1. Note that 
building this type of combined caller does not require a 
validation dataset — although estimating its performance 
does. 

Cumulatively adding mutation sets based on combination 
call status 

We explained how the sets of mutation sites defined by a 
Venn diagram could be used to build a combined caller. 
Restricting ourselves to mutation sets corresponding to 
a combination of detection statuses of the K callers, we 
obtain a partition of the mutation sites into 2^'^ — 1 dis- 
joint subsets. This partition can be used to systematically 
sort mutations by some measure of confidence that we 
have in their being somatic mutations. On Figure 1, these 
2^ - 1 = 7 disjoint sets are ABC, AB without C, AC with- 
out B, BC without A, A only, B only, and C only. We sort 
these 2^^ — 1 disjoint sets by their validation rate, i.e., by the 
proportion of true mutations that they contain, as shown 
on Table 1. These sorted sets of sites define a sequence 



Caller A Caller B 




Caller C 

Figure 1 Venn diagram of the point mutations detected by three 
callers on 20 TCGA endometrial tumor-normal exome-seq pairs. 
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Table 1 Validation results of the seven disjoint mutation 
sets shown in Figure 1 



Combination 


vai. 


CD 

rr 


TD 

TP 


Crr 


CTr 


call status 


rate (%) 


count 


count 


rate 


rate 


All callers 


99.4 


12 


1,914 


1.2 


55.3 


Caller A and C only 


96.4 


11 


294 


2.4 


63.8 


Caller A and Bonly 


96.3 


7 


184 


3.1 


69.1 


Caller B and C only 


94.4 


2 


34 


3.3 


70.1 


Caller C only 


79.6 


11 


43 


44 


71.3 


Caller A only 


59.7 


632 


935 


69.1 


98.4 


Caller B only 


15.9 


302 


57 


100 


100 



For each mutation set (row), the validation rate (Val. rate), the false positive (FP) 
and true positive (TP) counts, and the cumulative false positive (cFP) and 
cumulative true positive (cTP) rates in percentage, are presented. Mutation sets 
are ordered by the validation rate. 



of combined callers, sorted by stringency. The most strin- 
gent combined caller only predicts the site in the first set 
to be mutations. Then less stringent combined callers can 
be defined by adding the sites in the sorted sets. 

Fitting logistic models using the call status and genomic 
features 

Stacked generalization was first introduced in the neural 
network community [12] and later adapted to the statistics 
literature [9], as a systematic way to combine classifiers. 

Given a set of calls cij^ e {0, 1} for site 1 < i < n 
and caller 1 < k < K, stacking aims at building a linear 
function of the calls for each site / which predicts its true 
status yi as accurately as possible. In other words, we rep- 
resent each site by its K calls from the different callers, 
and learn a new classifier of mutation sites in this feature 
space. Formally, given a set of n sites with known calls ci^ 
for all callers and known true status j^, a linear stacking 
approach would solve: 



arg min Y^Ui-Yl PkCik 
^i,...A< i=i \ f,=i 



(1) 



i,e,y a linear regression in the call space, estimating weights 
such that a linear combination of the calls based on 
these weights is close to the true mutation status. The 
mutation status of a new site Ci defined by its calls from 
the K individual callers would then be predicted via 



f(Ci) = ^ fikCik' 
k=l 



(2) 



In practice, we use a logistic model rather than a lin- 
ear one, because it is better suited to binary classification 
[8] - we only have binary mutation status {0, 1} as opposed 



to scores or continuous confidence measures. Our estima- 
tor therefore becomes: 



arg min 



log( l + expj^^K,-;^ 



- yi XI PkCik [ 
k=i J 



(3) 



If the features cn^ are binary, which is the case if the 
individual callers returned binary decisions rather than 
continuous scores, the resulting classifier/(c/) is the sum 
of weights P/^ for callers which classified the site / as a 
somatic mutation. It can only take 2^^ — 1 distinct values 
on sites which were called by at least one caller. Each of 
these values corresponds to a unique combination of calls 
by the individual methods, which in turn corresponds 
to one of the disjoint subsets defined by the Venn dia- 
gram discussed in Section 'Cumulatively adding mutation 
sets based on combination call status'. If the effects of 
callers are additive, then the ranking of the sites defined 
by/ is expected to essentially behave like the more naive 
one defined in Section 'Cumulatively adding mutation sets 
based on combination call status'. 

The estimators defined by (1) and (3) combine the 
individual callers uniformly for all sites. It is however con- 
ceivable that some callers perform better for some types of 
sites, e.g., those with low coverage, and less well for others. 
We now assume that some descriptors gij, 1 of 
each site i are available besides the detection status of the 
three callers and the validation status. These descriptors 
could typically be genomic features. 

Feature-weighted linear stacking (FWLS, [13]) replaces 
each parameter fij^ of the stacking regression estimator (3) 
by a linear combination of the descriptors gij: 



Pk = 



(4) 



where the aj/^ parameters are weights corresponding to the 
relevance of feature gij to measure how predictive caller 
k is for site /. The weights are therefore site-specific, 
accounting for the fact that the relevance of a particular 
caller k may be different for sites with different genomic 
features. 

Plugging weights (4) in the linear function (2) yields a 
different set of coefficients for each site / : h(ci,gi) = 
ELi PkCik = ELi E/Li ^jkgijCih h is now a linear func- 
tion of the K X p products of features gij and calls c^^ so 
FWLS equivalently amounts to: 

(i) describing each site by this extended set of features, 
and 

(ii) estimating a linear classifier of mutation sites in this 
space. 
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Kxp 




log ^1 + exp ' 




1) 




1=1 






Formally, after plugging (4) in our stacking estimator (3) 
we see that FWLS solves: 



argmin 



where xu e R^"^^^ contains all the products of calls and 
genomic features for site /. The Kxp parameters y/ are 
the weights of the logistic regression. They are strictly 
equivalent to the aj/^ parameters of (4), we only use them 
to emphasize that FWLS can be formulated as a regular 
logistic regression estimator in an expanded feature space. 

In the experiments of this paper, we consider all com- 
binations of call status defined in Section 'Cumulatively 
adding mutation sets based on combination call status', 
i.e., all products of single calls rather than the single calls. 
Technically this can still be cast as a FWLS model, by 
adding all single calls and products of single calls to the 
set of features gij. In practice, our implementation reUes 
on (5), i.e., on a logistic regression in an expanded feature 
space. 

Finally, since the resulting feature space can become 
large, we choose to use an £i -penalized version of (5): 



arg min ^ 



\ ( 


Kxp 


\ Kxp 


Kxp 


log! 1+exp 










1=1 




1=1 



(6) 



Penalizing the ii norm Ylf=f IK/I ^^e parameter is 
known to lead to sparse estimators [14], and A g M is used 
to adjust the level of sparsity. 

Implementation and evaluation of combined callers 

The approach of building a combined caller by taking 
intersections or unions (Section Taking intersections or 
unions') does not require a training set, and evaluation 
of the caller can be done straightforwardly on a test 
set. The approach that cumulatively adds disjoint sub- 
sets (Section 'Cumulatively adding mutation sets based on 
combination call status') uses a training set to determine 
the order of subsets (by computing the validation rate of 
each subset), and evaluates the performance on a test set 
using the order. For the approach building a caller by fit- 
ting a logistic model (Section 'Fitting logistic models using 
the call status and genomic features'), a training set is used 
to estimate the yi parameters of (6). In order to choose 
the hyperparameter A, we perform 10-fold cross valida- 
tion on the training set for each candidate A to estimate 
the error of the associated model. Then the most parsi- 
monious model whose error is no more than one standard 
error above the error of the best model is chosen. Once 
A is selected, we re-estimate // using this A on the whole 



training set, and evaluate its performance on the test set. 
Experiments were conducted using the R package glmnet 
[15], which implements penalized GLMs, in particular the 
li penalized logistic regression of which (6) is an instance. 
The R scripts that contain our detailed implementation 
are included as Additional file 1. 

Results 

We have used the mutation datasets generated for the 
TCGA endometrial study [10]. For 194 tumor-normal 
Illumina exome-sequence pairs, somatic-mutation calling 
was done by three centers whose algorithms are referred 
to here as Caller A, B, and C. In total, 51,648 single 
nucleotide variant (SNV) type of mutations were detected. 
A large fraction of the mutations were targeted for cus- 
tom capture validation. As explained in the Additional 
file 2: Supplementary Methods, these sites were captured 
using the Nimblegen technology and then re-sequenced 
independently using an Illumina HighSeq 2000. In partic- 
ular, impartial validation (i.e. validating all calls from all 
callers) was carried out for all mutations in (1) a randomly 
selected 20 patients and (2) an additional 243 genes of 
interest from the remaining 174 patients. VaUdation sta- 
tus was successfully determined for all but a small fraction 
(less than 5%) of the validated mutations. For more details 
about the validation and determining the validation sta- 
tus, see Additional file 2: Supplementary Methods. Our 
final dataset consists of the successfully validated muta- 
tions: (1) 4,438 sites in the selected 20 patients and (2) 
an additional 2,308 sites within the 243 genes of interest. 
Note that almost all of these sites (> 95%) are included 
as example datasets in our software package (Additional 
file 1). 

For each point mutation site in our final dataset, we 
know the validation status (somatic' or 'non-somatic'), the 
call status (i.e., whether or not it was detected) by each of 
the three callers, the mutation substitution type (combina- 
tion of the reference allele and the variant allele), and the 
sequencing depth and the variant allele fraction in each 
tumor and normal sample based on the exome sequence 
data that was used for mutation-calling. A brief summary 
of our dataset is included as Table 1, Additional file 2: 
Table SI and Figures S1-S4. Caller B provided more infor- 
mation besides the positions of the detected mutations. 
For a broader set of somatic variants (candidate muta- 
tions), it reported the mutation quality score as well as the 
pass/fail status of individual filters at each site. Although 
the detailed description of each filter was not available, 
the filter outcomes were available (Additional file 2: Table 
S2), which we were able to use for improving Caller B's 
performance (Section 'Improving a single caller's perfor- 
mance using details of its filters'). In Section 'Building 
and evaluating combined callers', we demonstrate how to 
build a combined caller using the calling status of the three 
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individual mutation callers and a few genomic features. 
In Section Improving a single caller s performance using 
details of its filters', we show the potential for improv- 
ing the performance of an individual caller using more 
detailed outputs, using Caller B as an instance. 

Building and evaluating combined callers 

We first used the mutations detected from the 20 selected 
patients (total: 4,438) to build and evaluate combined 
callers. Assuming (for illustrative purposes) that the char- 
acteristics of our mutations are not affected by sample- 
specific features, we randomly split the data into 50% 
training and 50% test sets. Other fractions were explored, 
but the qualitative conclusions were similar as long as 
there was enough data to train the model, e.g., more than 
20% of the total. 

The performance of the combined caller constructed by 
fitting a logistic model (defined in Section 'Fitting logis- 
tic models using the call status and genomic features') is 
shown as a receiver operating characteristic (ROC) curve 
in Figure 2. The explanatory variables for this logistic 
model consist of the combination call status (7 — 1 vari- 
ables), sequencing depth and variant allele fraction in each 
tumor and normal sample (4 variables), mutation substi- 
tution type (12 — 1 variables), and interactions between 
the combination call status variables and other features 
(90 variables). Note that we used combination call status 
(7 — 1 variables) instead of the call status of each individ- 
ual caller (3 variables) as shown in (6) in Section 'Fitting 
logistic models using the call status and genomic features'. 
We used the combination call status, since we do not want 
to assume that the effects of callers are necessarily addi- 
tive. For example, in reality, a certain sequence feature may 
mislead two callers, but the remaining single caller may 
have a better filter for it. Therefore rather than imposing 
additivity, we would like to characterize each combination 
call status separately. The model fitting was done based 
on a randomly selected 50% training sites, then predic- 
tion was made on the remaining 50% test sites, enabling 
us to sort the mutations. A more stringent caller can be 
constructed by taking a smaller percentage of high-ranked 
mutations as final calls, and a more liberal caller can be 
constructed by including a larger percentage of mutations 
as final calls. 

The performances of individual callers and combined 
callers are summarized in Figure 2. Note that validation 
was done only for the mutations that were detected by at 
least one of the three callers, and therefore, the union of 
all mutations comprises all true positives and all false pos- 
itives. The results of three individual callers are given at 
three points with different false positive rates, i.e., differ- 
ent stringency levels. Caller A is the most liberal in the 
sense that it detected many false positives (FP rate at 68%) 
but also detected most of the true positives (TP rate at 



A Callers 

• Cumulatively add sets 
■ Our combined caller 




OiHfMro'^in(^i^ooa>i-i 
ddddodocici 

False positive rate 

Figure 2 Performances of individual and combined callers. 

Model fitting was done using a random 50% of the point mutations 
detected from the selected 20 patients, and evaluation was done 
based on the remaining half. In the main panel, the true positive and 
the false positive rates of various callers are shown: (1) three individual 
callers (red filled triangles): Caller A, Caller B, and Caller C, (2) the caller 
that cumulatively adds mutation sets based on the combination call 
status in the order of the validation rate (connected blue dots), (3) the 
combined caller built by fitting a logistic model (for details, see text) 
(green line). The area near the point showing Caller C's performance 
is enlarged and shown as a small sub-panel on the lower right part of 
the main figure. This panel further indicates the performance of the 
callers that take unions or intersections of calls from three callers 
(brown diamonds): all callers (ABC), intersections of two callers (AB, 
AC, or BC), called by more than two callers ('2orMore'). 



96%). Caller C has a very small FP rate (4%) but detected 
only 67% of the true positives. Caller B performs poorer 
than Caller C, since it detected not only more false posi- 
tives but also less true positives. The performance of the 
caller taking unions or intersections of the calls is marked 
as another set of points, inside of the sub-panel on the 
lower right part of the main panel. The stringency levels of 
these callers are not necessarily ordered. For example, the 
set of mutations called by two or more callers (2orMore) 
is nested within any intersection of two callers (AB, AC, or 
BC), but no ordering exists among the latter three inter- 
sections. In contrast to this, the performance of the caller 
adding mutations sets cumulatively is shown as a con- 
nected set of blue dots because of the natural ordering 
determined based on the validation rates. In reality, the 
ordering may not be the same between the training set 
and the test set. When the validation rates are very similar 
among the mutation subsets or the number of mutations 
in each set is very small, sampling variation could easily 
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result in a different ordering. In the training set, the vali- 
dation rates of the mutation set called by A and C but not 
B, and the set called by A and B but not C, are 97.99% and 
97.96%, respectively. 

Overall, our combined caller obtained by fitting a logis- 
tic model outperforms the individual callers and other 
naive combinations. The ROC curve of this combined 
caller is above of all the points representing the per- 
formance of individual callers, although sometimes only 
slightly so. Further, the combined caller allows us to assess 
the performance across the full range of stringency levels. 

Improving a single caller's performance using details of its 
filters 

For Caller B, mutation quality scores as well as the out- 
comes of individual filters were available for a broader set 
of somatic variants. (Note that for each caller, the detected 
mutations are the somatic variants that passed all the 
filters implemented by that caller.) In Figure 2, the perfor- 
mance of Caller B was shown as a single point. Here, we 
demonstrate how such extra details besides the call status 
can be used to improve the performance. Furthermore, to 
prove the validity of our approach in a wider dataset, we 
trained and tested on two different mutation datasets that 
were generated for the TCGA endometrial study using the 
same mutation calling algorithms, but constructed from 
different genomic regions as well as different tumor and 
normal samples. Specifically, we trained a model on the 
mutations from the 243 genes of interest from 174 patients 
(our second dataset described at the beginning of Section 
'Results'), then evaluated on the mutations from the whole 
exomes of the 20 patients (first dataset). A similar analysis 
was performed with the roles of the two datasets switched 
(Additional file 2: Figure S5). 

Since a mutation quality score was available for Caller 
B, we first drew an ROC curve by sorting the calls that 
were detected by Caller B (Figure 3). As expected, the 
right most point in the ROC curve (besides the one at 
the FP rate of 1.0) corresponds to the point for which 
Caller B was previously evaluated. We then fitted a logistic 
model including the mutation quality score and the indi- 
vidual filter outcomes (indicator of pass/fail) from Caller 
B as explanatory variables. The estimated coefficients for 
the individual filters are summarized in Additional file 2: 
Table S2 (note that these coefficients were estimated from 
a set of ascertained sites for which each site was called by 
at least one of the three callers). 

By utilizing the outcomes of individual filters. Caller 
Bs performance has improved substantially (Figure 3). 
At a false positive rate of 33%, the true positive rate 
increases from 63% to 78%, detecting 520 more muta- 
tions. This highlights the importance of having the full 
details of all features involved in the final decision on a 
variant. 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 




Individual callers 

Cumulatively add sets 

Our combined caller 

Updated Caller B using quality score 

Updated Caller B using quality score and filters 



False positive count 

Figure 3 ROC curve of an improved Caller B built by fitting a 
logistic model using the mutation quality score and individual 
filters of Caller B. Model fitting was done using the point mutations 
in 243 genes of interest from 1 74 patients excluding the 20 patients, 
and evaluation was done on the point mutations in the 20 selected 
patients. The performances of three individual callers (red filled 
triangles), the combined caller that cumulatively adds mutation sets 
(connected blue dots), and the combined caller by fitting the logistic 
model (green lilne) are shown for comparison purposes. ROC curves 
of two updated versions of Caller B are shown. One version is 
obtained by ranking the mutations detected by Caller B using the 
mutation quality score of Caller B (violet line), and the other version 
by fitting a logistic model using the mutation quality score and the 
individual filters of Caller B on an extended set of mutations that were 
detected by at least one of the three callers (orange line). 



Furthermore, if similar details were available for Caller 
A and C, then we could generalize the logistic model in 
previous section (Section 'Building and evaluating com- 
bined callers') including outcomes of individual filters 
from all callers, which potentially leads to a higher power 
as well as better insight on the cause of mutation-calling 
errors. 

Discussion 

In this paper, we present an approach for effectively build- 
ing a combined caller using the outputs from three muta- 
tion callers. Our approach is valid with more than three 
callers or less concordant mutation call outputs, as long 
as impartial validation data is available for all calls from 
all mutation callers as a training data, and the relative per- 
formance of individual callers is expected to be consistent 
between the training set and the test set. The combining 
approach could be even more beneficial if the individual 
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callers agreed less — assuming (i) they all had compa- 
rable individual performances and (ii) the set of loci on 
which each caller is right could be characterized in terms 
of genomic features. In this case, the FWLS approach 
could learn the type of locus on which each caller is typi- 
cally right and output the best answer for each new locus, 
resulting in a more accurate calling. 

We have analyzed mutation sites that were successfully 
validated based on the criteria described in Additional 
file 2: Supplementary Methods. Those validation crite- 
ria may not be perfect, but we found them reasonable to 
demonstrate our approach. Changes in validation crite- 
ria can result changes in individual callers' performances 
and thus the final model estimated. For example, more 
stringent criteria are likely to treat all very rare muta- 
tions as false calls, and thus in our exercise, may reduce 
the sensitivity of Caller A to a large extent. However, our 
approach remains to provide a convenient framework to 
build the best combined model, given any validation sta- 
tus. In practice, determining validation status based on 
an independent sequencing data can be very challenging, 
and developing highly accurate validation method itself 
is another research topic. Working on better validation is 
out of scope for our paper, but if uncertainty in the vali- 
dation could be quantified, it could be used in the logistic 
model fitting to weight more accurate calls. 

In practice, an effective validation strategy is essential 
for building a successful model. In principle, a training 
dataset is supposed to contain all sites characterizing a 
wider dataset for which one wishes to apply the estimated 
model. Therefore, a validation dataset needs to include 
enough sites to learn the behaviors of the mutation-calling 
algorithms across a broad spectrum of genomic features. 
Another important issue is to have impartially validated 
sites. If validation is done partially, then the composi- 
tion of a training dataset is biased and thus the estimated 
parameters and the performance are also biased. 

Conclusions 

Our approaches provide a unified framework for dealing 
with multiple somatic-mutation callers. If the callers pro- 
vide only the list of positions detected as mutations, then 
it is difficult to compare them, or to investigate the trade- 
off between the stringency of the calling-procedure and 
its power to detect true mutations. Our combined caller 
can be used to overcome these difficulties. It offers an 
evaluation of its performance across the full range as an 
ROC curve, and in addition, allows easy comparison with 
individual callers. 

Furthermore, we have shown that it is feasible to build a 
combined caller that performs better than all the individ- 
ual callers, one which could be better (even slightly) than 
a caller combining calls only based on the detection sta- 
tus. An even more powerful caller can possibly be built 



when more features associated with calling performance 
are available, such as individual details of the filters used 
by each caller or the measure of strand bias. 

Finally, we demonstrate the potential for building a com- 
bined caller using a small validation dataset (generated 
for a subset of regions or samples in the original study), 
which can be applied to a wider dataset to assign a confi- 
dence measure that can be used for ranking the mutations 
from multiple callers. Our two mutation datasets, one 
from the selected 20 patients and the other from 243 genes 
of interest across 174 patients share protocols (sample 
preparation, sequencing technology, alignment methods, 
and the applied mutation-calling algorithms) but differ for 
genomic regions and the tumor and normal samples used 
for calling. The results from training the model using one 
of the datasets and evaluating on the other suggest that the 
estimated models based on these validation datasets are 
generally applicable to the mutations from whole exomes 
of all 194 endometrial patients. 

Additional files 



Additional file 1 : Software package. A .tar.gz file that contains R scripts 
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Supplementary Methods, Tables and Figures. 
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